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Abstract 



We solve Talagrand's entropy problem: the L2-covering numbers of 
every uniformly bounded class of functions are exponential in its shat- 
tering dimension. This extends Dudley's theorem on classes of {0, 1}- 
valued functions, for which the shattering dimension is the Vapnik- 
Chervonenkis dimension. 

In convex geometry, the solution means that the entropy of a con- 
vex body K is controlled by the maximal dimension of a cube of a 
fixed side contained in the coordinate projections of K. This has a 
number of consequences, including the optimal Elton's Theorem and 
estimates on the uniform central limit theorem in the real valued case. 

1 Introduction 

The fact that the covering numbers of a set are exponential in its linear 
algebraic dimension is fundamental and simple. Let A be a class of functions 
bounded by 1, defined on a set f2. If A is a finite dimensional class then for 
every probability measure on /i on f2, 



where dim(74) is the linear algebraic dimension of A and the left-hand side 
of (|1|) is the covering number of A, the minimal number of functions needed 
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< t < 1 



(1) 



to approximate any function in A within an error t in the L2(/^)-norm. This 
inequahty follows by a simple volumetric argument (see e.g. ||P|] Lemma 4.10) 
and is, in a sense, optimal: the dependence both on t and on the dimension 
is sharp (except, perhaps, for the constant 3). 

The linear algebraic dimension of A is often too large for (|l|) to be useful, 
as it does not capture the "size" of A in different directions but only deter- 
mines in how many directions A does not vanish. The aim of this paper is 
to replace the linear algebraic dimension by a combinatorial dimension origi- 



nated from the classical works of Vapnik and Chervonenkis |[VC 71|| , |[VC 81 

We say that a subset cr of is t-shattered by a class A if there exists 
a level function h on a such that, given any subset cr' of cr, one can find 
a function / G A with f{x) < h{x) if x G cr' and f{x) > h{x) + t if x G 
a \ a' . The shattering dimension of A, denoted by vc{A, t) after Vapnik and 
Chervonenkis, is the maximal cardinality of a set t-shattered by A. Clearly, 
the shattering dimension does not exceed the linear algebraic dimension, 
and is often much smaller. Our main result states that the linear algebraic 
dimension in (|1|) can be essentially replaced by the shattering dimension. 

Theorem 1 Let A be a class of functions bounded by 1, defined on a set Q. 
Then for every probability measure on Q, 

/2 \ K-vc(A,ct) 

iV(A,t,L2(/i))< (-) , 0<t<l, (2) 

where K and c are positive absolute constants. 

There also exists a (simple) reverse inequality complementing (0): for some 
measure /i, one has N{A,t, L2{fi)) > 2^""^^^'^^\ where K and c are some 
absolute constants, see e.g. |[1' 02|| . 

The origins of Theorem [l| are rooted in the work of Vapnik and Chervo- 
nenkis, who first understood that entropy estimates are essential in determin- 
ing whether a class of functions obeys the uniform law of large numbers. The 
subsequent fundamental works of Koltchinskii |K| and Cine and Zinn | |GZ[| 
enhanced the link between entropy estimates and uniform limit theorems (see 
also WWj ). 



In 1978, R. Dudley proved Theorem |] for classes of {0, l}-valued functions 
( [Pu|| , see [[LT|| 14.3). This yielded that a {0, l}-class obeys the uniform law 



of large numbers (and even the uniform Central Limit Theorem) if and only 
if its shattering dimension is finite for < t < 1. The main difficulty in 
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proving such limit theorems for general classes has been the absence of a 
uniform entropy estimate of the nature of Theorem |l] ( [ [T 88|| , ||T 92|| , [[T 96 



ABCH|| , ||BL|| , |[r 02|| ). However, proving Dudley's result for general classes 



is considerably more difficult due to the lack of the obvious property of the 
{0, l}-valued classes, namely that if a set a is t-shattered for some < t < 1 
then it is automatically 1-shattered. 

In 1992, M. Talagrand proved a weaker version of Theorem ^ under 
some mild regularity assumptions, \ogN{A,t, L2{n)) < K-yc{A, ct)log*^(|), 
where K, c and M are some absolute constants ( ||T 92|| , fT 02|| ). Theorem |l] is 



Talagrand's inequality with the best possible exponent M = 1 (and without 
regularity assumptions) . 

Talagrand's inequality was motivated not only by limit theorems in prob- 
ability, but to a great extent by applications to convex geometry. A subset 
B of can be viewed as a class of real valued functions on {1, . . . , n}. If 
B is convex and, for simplicity, symmetric, then its shattering dimension 
vc{B,t) is the maximal cardinality of a subset a of {l,...,n} such that 
Pa{B) D [—1)1]°^) where denotes the orthogonal projection in onto 
]R°". In the general, non-symmetric, case we allow translations of the cube 
[-|, f]'^ by a vector in W. 

The following entropy bound for convex bodies is then an immediate 
consequence of Theorem |I|. Recall that N{B, D) is the covering number of 
S by a set D in M", the minimal number of translates of D needed to cover 
B. 

Corollary 2 There exist positive absolute constants K and c such that the 
following holds. Let B be a convex body contained in [0, 1]", and Dn be the 
unit Euclidean ball in M". Then for < t < 1 

N{B,tV^Dr,)<(^-J , 

where d is the maximal cardinality of a subset a of {1, ... ,n} such that 
Pa{B) ^ + [0, ctY for some vector h inW. 



As M. Talagrand notices in ||T 02|] , Theorem |l| is a "concentration of 



pathology" phenomenon. Assume one knows that a covering number of the 
class A is large. All this means is that A contains many well separated 
functions, but it tells nothing about the structure these functions form. The 
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conclusion of is that A must shatter a large set a, which detects a very 
accurate pattern: one can find functions in A oscillating on a in all possible 
2^"^ ways around fixed levels. The "largeness" of A, a priori diffused, is a 
fortiori concentrated on the set a. 

The same phenomenon is seen in Corollary given a convex body B 
with large entropy, one can find an entire cube in a coordinate projection of 
B, the cube that certainly witnesses the entropy's largeness. 

When dualized. Corollary |] solves the problem of finding the best asymp- 
totics in Elton's Theorem. Let xi, . . . ,Xn be vectors in the unit ball of a 
Banach space, and £i, be Rademacher random variables (independent 

Bernoulli random variables taking values 1 and —1 with probability 1/2). 
By the triangle inequality, the expectation E|| Yl'i=i^i^i\\ most n, and 
assume that E|| Yl^=i ^i^iW ^ ^'^ fo^' some number 6 > 0. 

In 1983, J. Elton Q proved an important result that there exists a subset 
cr of {1, 77.} of size proportional to n such that the set of vectors (xi),^^ 
is equivalent to the ii unit-vector basis. Specifically, there exist numbers 
s,t > 0, depending only on 6, such that 

\a\ > s'^n and jj^^ajXj >t^^|0'i| for all real numbers (a^). (3) 

Several steps have been made towards finding the best possible s and t in 
Elton's Theorem. A trivial upper bound is s, t < 6 which follows from 
the example of identical vectors and by shrinking the usual ii unit-vector 
basis. As for the lower bounds, J. Elton proved @ with s ~ S/\og{l/S) and 
t ~ 6^. A. Pajor [[Pajl removed the logarithmic factor from s. M. Talagrand 
T 92|| , using his inequality discussed above, improved t to 5/ log*^(l/(5). In 
the present paper, we use Corollary ^ to solve this problem by proving the 
optimal asymptotics: s, t ~ 5. 

Theorem 3 Let Xi, . . . ,x„ be vectors in the unit ball of a Banach space, 
satisfying 



> Sn for some number S > 0. 



SiXi 
1=1 

Then there exists a subset a G {1, . . . ,n} of cardinality \a\ > c6^n such that 
II OjXj > c5 |aj| for all real numbers (oj), 

where c is a positive absolute constant. 
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Furthermore, there is an interplay between the size of a and the isomorphism 
constant - they can not attain their worst possible values together. Namely, 
we prove that s and t in (^) satisfy in addition to s, t > 5 also the lower 
bound s ■ tlog^-^(2/t) > 5, which, as an easy example shows, is optimal for 
all S within the logarithmic factor. The power 1.6 can be replaced by any 
number greater than 1.5. This estimate improves one of the main results of 
the paper |P? 92|| where this phenomenon in Elton's Theorem was discovered 
and proved with a constant (unspecified) power of logarithm. 

The paper is organized as follows. In the remaining part of the intro- 
duction we sketch the proof of Theorem 0; the complete proof will occupy 
Section ^ Section ^ is devoted to applications to Elton's Theorem and to 
empirical processes. 

Here is a sketch of the proof of Theorem 0. Starting with a set A which is 
separated with respect to the L2(/i)-norm, it is possible find a coordinate u e 
Q (selected randomly) on which A is diffused, i.e. the values {/(w), f E A} 
are spread in the interval [—1, 1]. Then there exist two nontrivial subsets Ai 
and A2 of A with their set of values {f{uj), f G Ai} and {f{uj), f G A2} well 
separated from each other on the line. Continuing this process of separation 
for Ai and A2, etc., one can construct a dyadic tree of subsets of A, called a 
separating tree, with at least leaves. The "largeness" of the class A is 

thus captured by its separating tree. 

The next step evoked from a beautiful idea in ||ABCH|] . First, there is no 
loss of generality in discretizing the class: one can assume that Q is finite 
(say \ = n) and that the functions in A take values in |Z fl [—1, 1]. Then, 
instead of producing a large set a shattered by A with a certain level function 
h, one can count the number of different pairs (cr, h) for which a is shattered 
by A with the level function h. If this number exceeds Ylt=o (fc)(^)^ then 
there must exist a set a of size |cr| > c? shattered by A (because there are (2) 
possible sets a of cardinality k, and for such a set there are at most (y)*^ 
possible level functions). 

The only thing remaining is to bound below the number of pairs (cr, h) 
for which a is shattered by A with a level function h. One can show that 
this number is bounded below by the number of the leaves in the separating 
tree of A, which is \A\^/\ This implies that \A\^/^ < ^^^^ ~ (^)^ 

where d = vc{A, ct). The ratio ^ can be eliminated from this estimate by a 
probabilistic extraction principle which reduces the cardinality of Vt. 
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2 The Proof of Theorem [J 

For t > 0, a pair of functions / and (7 on is t-separated in L2{fi) if ||/ — 
g\\L2{^l) > t. A set of functions is called t-separated if every pair of distinct 
points in the set is t-separated. Let Nsep{A,t, L2{fi)) denote the maximal 
cardinality of a t-separated subset of A. It is standard and easily seen that 

N{A,t,L2{i^)) < N,,p{A,t,L2{fi)) < N{A, ^-,L2{fM)). 

This inequality shows that in the proof of Theorem |I] we may assume that 
A is t-separated in the L2{fi) norm, and replace its covering number by its 
cardinality. 

We will need two probabilistic results, the first of which is straightforward. 

Lemma 4 Let X be a random variable and X' be an independent copy of 
X. Then 

E\X - X'p = 2E\X - EXl"^ = 2inf E|X - a|l 

a 

The next lemma is a small deviation principle. Denote by (j{Xy = 
E|X — EXp the variance of the random variable X. 

Lemma 5 Let X be a random variable with nonzero variance. Then there 
exist numbers a G M and < (3 < ^, so that letting 

Pi =F{X > a + la{X)} and 
P2 = nX <a-la{X)}, 

one has either Pi > 1 — (3 and P2 > ^, or p2 > 1 — f3 and pi > |. 
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Proof. Recall that a median of X is a number Mx such that P{X > Mx} > 
1/2 and P{X < Mx} > 1/2; without loss of generality we may assume that 
Mx = 0. Therefore F{X > 0} = 1 - ¥{X < 0} < 1/2 and similarly 
P{X < 0} < 1/2. 
By Lemma 

poo 

a{Xf < E|X|2 = / P{|X| > A} dX^ 
Jo 

POO POO 

= / P{X > A} dX^ + / P{X < -A} rfA^ (4) 
Jo Jo 

where dX"^ = 2A dX. 

Assume that the conclusion of the lemma fails, and let c be any number 
satisfying | < c < Divide M+ into intervals h of length ca{X) by setting 



A; = 0,1,2,... 



h = (ca{X)k, c(T(X)(fc + 1) 
and let Po, Pi, P2, ■ ■ ■ be the non-negative numbers defined by 

P{X > 0} = /5o < 1/2, P{X e Ik} = Pk - Pk+u A; = 0,1,2,... 
We claim that 

forallfc>0, Pk+i<^Pk. (5) 

Indeed, assume that Pk+i > \Pk for some k and consider the intervals 
= {-00, C(j{X)k] and J2 = (c(T(X)(fc + 1), 00). Then Ji = (-00, 0] U 
(Uo<Kfc-i^O> so 

P{Xg Ji} = (l-/5o)+ Yl {f3i - Pi+i) = I - Pk. 

0<l<k-l 

Similarly, J2 = Ui>fc+i ^^us 

P{x G J2} = 5^ (A - A+i) = Pk+i > \Pk. 

l>k+l 

Moreover, since the sequence [Pk) is non- increasing by its definition, then 
Pk > Pk+1 > |/5/c > and < /5o < |. Then the conclusion of the lemma 
would hold with a being the middle point between the intervals Ji and J2 
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and with (3 = (3^^ which contradicts the assumption that the conclusion of 
the lemma fails. This proves (|^). 

Now, one can apply (^) to estimate the first integral in (^. Note that 
whenever X E h, 

¥{X >\}< F{X > caiX)k} = p(|J J,) = /3k. 

l>k 

Then 

POO P 

/ ¥{X > X} dX^ <J2 Pk- 2X dX 

Jo fc>o Jh 

<J2Pk-2ca{X){k + l)\ength{h). (6) 

fc>0 

Applying (|^) inductively, it is evident that < (|)'^/5o < 2^^' since 
length(/fc) = C(t(X), (|^) is bounded by 

k>0 

By an identical argument one can show that the second integral in (^) is 
also bounded by |cr(X)^. Therefore 

a{Xr <^-a{Xr +^-a{Xr = a{Xr, 
and this contradiction completes the proof. ■ 

Constructing a separating tree 

Let A be a finite class of functions on a probability space which is 

t-separated in Iv2(/i). Throughout the proof we will assume that \A\ > 1. 
One can think of the class A itself as a (finite) probability space with the 

uniform measure on it, that is, each element a; in A is assigned probability 

1 

1^1 ■ 
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Lemma 6 Let A be a t-separated subset of L2{^). Then, there exist a coor- 
dinate i in Q and numbers a G M and < f3 < 1/2, so that setting 

Ni = \{x e A : x{i) > a + and 
N2 = \{x e A: x{i) < a - 

one has either Ni> {1 — I3)\A\ and N2 > ^\A\, or vice versa. 

Proof. Let x, x' be random points in A selected independently according 
to the uniform (counting) measure on A. By Lemma ^, 

E||a; — x'll^^^^) = E / \x{i) — x^i)]"^ dfi{i) = / K\x{i) — x'{i)\'^ dfi{i) 
Jq Jq 

= 2 1 E|x(i) - Ex(i)|2 c//i(i) (7) 

= 2 (r{x{i)f dfi{i) 
Jn 

where a{x{i))'^ is the variance of the random variable x{i) with respect to 
the uniform measure on A. 

On the other hand, with probability 1 — we have x ^ x' and, whenever 
this event occurs, the separation assumption on A implies that x'||i^2(^) > 
t. Therefore 

E||x-x'||L,^,>(l-ii|)*^>| 

provided that \A\ > 1. 

Together with (0) this proves the existence of a coordinate i E fl, on 
which 

a{x{^)) > i (8) 

and the claim follows from Lemma ^ applied to the random variable x{i). ■ 

This lemma should be interpreted as a separation lemma for the set A. It 
means that one can always find two nontrivial subsets of A and a coordinate 
in i7, on which the two subsets are separated with a "gap" proportional to t. 

Based on Lemma ^, one can construct a large separating tree in A. Recall 
that a tree of subsets of a set A is a finite collection T of subsets of A such 
that, for every pair B,D E T either B and D are disjoint or one of them 
contains the other. We call D a son of i? if is a maximal (with respect to 
inclusion) proper subset of B that belongs to T. An element of T with no 
sons is called a leaf 
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Definition 7 Let A be a class of functions on f2 and t > 0. ^4 t-separating 
tree T of A is a tree of subsets of A such that every element B E T which is 
not a leaf has exactly two sons and i?_ and, for some coordinate i G fi, 

f{i) > g{i) + t for all f e B+, g e B^. 

Proposition 8 Let A be a finite class of functions on a probability space 
If A is t-separated with respect to the L2{fJ,) norm, then there exists 
a ^t-separating tree of A with at least leaves. 

Proof. By Lemma any finite class A which is t-separated with respect 
to the L2{^) norm has two subsets A+ and A_ and a coordinate i E Q for 
which f{i) > g{i) + ^t for every / G A^ and g G Moreover, there exists 
some number < /3 < 1/2 such that 

l^+l > (1 and \A^\ > ^, or vice versa. 

Thus, y4_|_ and A_ are sons of A which are both large and well separated on 
the coordinate i. 

The conclusion of the proposition will now follow by induction on the 
cardinality of A. The proposition clearly holds for \A\ = 2. Assume it holds 
for every t-separated class of cardinality bounded by A^, and let A he a t- 
separated class of cardinality A^+1. Let A+ and A- be the sons of A as above; 
since /3 > 0, we have \A^\, \A^\ < N. Moreover, if A^ has a |t-separating 
tree with leaves and A^ has a |t-separating tree with leaves then, 
by joining these trees, A has a |t-separating tree with + leaves, the 
number bounded below by |y4_|_|^/^ + |y4_|^/^ by the induction hypothesis. 
Since (3 < 1/2, 

l^+|U|A_|^> ((1-/3)|A|)U(^|A|)^ 



|y4|2 > U|2 



as claimed. 



The exponent 1/2 has no special meaning in Proposition p. It can be 
improved to any number smaller that 1 at the cost of reducing the constant 
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Counting shattered sets 

As explained in the introduction, our aim is to construct a large set shattered 
by a given class. We will first try to do this for classes of integer-valued 
functions. 

Let A be a class of integer-valued functions on a set Q. We say that a 
couple (a, h) is a center if a is a finite subset of fl and h is an integer- valued 
function on a. We call the cardinality of a the dimension of the center. For 
convenience, we introduce (the only) 0-dimensional center (0, 0), which is the 

trivial center. 

Definition 9 The set A shatters a center {a, h) if the following holds: 

• either {a, h) is trivial and A is nonempty, 

• or, otherwise, for every choice of signs 9 G { — 1, 1}°" there exists a 
function f E A such that for i E a 

f{i) > h{i) when e{i) = 1, 
f{i) < h{i) when 9{i) = -1. ^ ^ 

It is crucial that both inequalities in are strict: they ensure that whenever 
a d-dimensional center is shattered by A, one has vc{A, 2) > d. In fact, it is 
evident that vc(y4, 2) is the maximal dimension of a center shattered by A. 

Proposition 10 The number of centers shattered by A is at least the number 
of leaves in any 1-separating tree of A. 

Proof. Given a class B of integer- valued functions, denote by s{B) the 
number of centers shattered by i?. It is enough to prove that if and -B_ 
are the sons of an element S of a 1-separating tree in A then 

s{B)>s{B+) + s{B_). (10) 

By the definition of the 1-separating tree, there is a coordinate io ^ ^) such 
that f{io) > g{io) + 1 for all / G 5+ and g G B_. Since the functions are 
integer-valued, there exists an integer t such that 

f{io) > t for / G -B+ and g{io) <t hi g e B_. 
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If a center {a, h) is shattered either by or by B_ , it is also shattered 
by B. Next, assume that {(J,h) is shattered by both B^ and B_. Note 
that in this case io ^ a. Indeed, if the converse holds then a contains io 
and hence is nonempty. Thus the center {x, a) is nontrivial and there exist 
f e B+ and g E B^ such that t < f{io) < h{io) (by (||) with 9{io) = -1) and 
t > gi'^o) > h{io) (by (^ with 6'(io) = 1), which is impossible. Consider the 
center {a', h') = (a U {^o}; h(Bt), where /i © t is the extension of the function 
h onto the set a U {io} defined by {h © t){io) = t. 

Observe that (cr', h') is shattered by B. Indeed, since 5+ shatters (a, h), 
then for every 9 E {—1, 1}°^ x {1}'^*°^ there exists a function / G -B+ such that 
(H) holds for i E a. Also, since / E B^, then automatically f{io) > t = h'iio). 
Similarly, for every 6 E { — 1, 1}'^ x { — there exists a function / E B_ 
such that (P) holds for z G a and automatically f{io) < t = h'{iQ). 

Clearly, {<j',h') is shattered by neither 5+ nor by because f{io) > 
t = h'{io) for all / E -B+, so (|) fails if 6'(io) = — 1; a similar argument holds 
for B_. 

Summarizing, (a, h) [a', h') is an injective mapping from the set of 
centers shattered by both B^ and B_ into the set of centers shattered by B 
but not by -B+ or B^, which proves our claim. ■ 



Combining Propositions ^ and |10|, one bounds from below the number of 
shattered centers. 



Corollary 11 Let A be a finite class of integer-valued functions on a prob- 
ability space (Q,^). If A is 6-separated with respect to the L2{fi) norm then 
it shatters at least \A\^^'^ centers. 

To show that there exists a large dimensional center shattered by A, 
one must assume that the class A is bounded in some sense, otherwise one 
could have infinitely many low dimensional centers shattered by the class. 
A natural assumption is the uniform boundedness of A, under which we 
conclude a preliminary version of Theorem |I]. 

Proposition 12 Let be a probability space, where Q is a finite set of 

cardinality n. Assume that A is a class of functions on Vl into {0, 1, . . . 
which is 6-separated in L2(/i). Set d to be the maximal dimension of a center 
shattered by A. Then 

1.1 < (^f. (11) 
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where C is an absolute constant. In particular, the same assertion holds for 
d = vc{A,2). 

Proof. By Corollary A shatters at least centers. On the other 

hand, the total number of centers whose dimension is at most d that a class 
of {0, 1, . . . ,p}-valued functions on f2 can shatter is bounded by ^^=o {T)P^- 
Indeed, for every k there exist at most (2) subsets cr C f2 of cardinality 
k and, for each a with |cr| = k there are at most level functions h for 
which the center (cr, h) can be shattered by such a class. Therefore < 
Sfc=o {T)P'' (otherwise there would exist a center of dimension larger than d 
shattered by A, contradicting the maximality of d). The proof is completed 
by approximating the binomial coefficients using Stirling's formula. ■ 

Actually, the ratio n/ d can be eliminated from (|1T]) (perhaps at the cost of 
increasing the separation parameter 6). To this end, one needs to reduce the 
size of fl without changing the assumption that the class is "well separated" . 
This is achieved by the following probabilistic extraction principle. 

Lemma 13 There is a positive absolute constant c such that the following 
holds. Let Q be a finite set with the uniform probability measure fi on it. Let 
A be a class of functions bounded by 1, defined on Q. Assume that for some 
0<t<l 

A is t-separated with respect to the L2{^) norm. 

If 1^1 ^ |exp(ct^/c) for some positive number k, there exists a subset a dVL 
of cardinality at most k such that 

A is -^-separated with respect to the L2{^a) norm, 
where fi^ is the uniform probability measure on a. 

As the reader guesses, the set a will be chosen randomly in Q. We will 
estimate probabilities using a version of Bernstein's inequality (see e.g. [[VW1| , 
or ||L'1]| 6.3 for stronger inequalities). 



Lemma 14 (Bernstein's inequality) Let Xi, . . . ,X„ be independent ran- 
dom variables with zero mean. Then, for every u > 0, 



P 



n 

|| ^X,| > m| < 2exp 



2(62 + au/3) 

where a = supj HXiHoo and b"^ = J2^=i 



i=l 

|2 
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Proof of Lemma T3. For the sake of simplicity we identify Q with 
{1,2,..., n}. The difference set S = {f — g\ f g, f,g& A} has cardinahty 
15*1 < For each a; e 5" we have \x{i)\ < 2 for all i G {l,...,n} and 

Y17=i 1^(0 P — t'^''^- Fix an integer k satisfying the assumptions of the lemma 
and let 5i, . . . , 5„ be independent {0, l}-valued random variables with E(5i = 



— : 6. Then for every z E S 



2n 



1=1 i=l i=l 

P 



{|5:(5„-5)|x(.)r|>^} 
< 2exp - — < 2exp(-ct^/c), 



where the last line follows from Bernstein's inequality for a = supj ||Xj|| < 2 
and 

n n 

6^ = ^E|X,|2 = ^ |a;(i)|^ E(5i - 6)^ < 166n. 

i=l 1=1 

Therefore, by the assumption on k 

P|3a; e S : X^^^a^WP)^^^ - ^} - l-^l " 2 exp(-ct^/t) < 1/2. 

i=l 

Moreover, if a is the random set {i\ 6i = 1} then by Chebyshev's inequality, 

n 

> A;} = p|^(5i > A;| < 1/2, 



0" 



1=1 



which implies that 



F{3xeS:\\x\\L,M<^-}<l. 



This translates into the fact that with positive probability the class A is 
^-separated with respect to the L2{^a) norm. ■ 



Proof of Theorem One may clearly assume that \A\ > 1 and that 
the functions in A are defined on a finite domain Q, so that the probability 
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measure /i on f2 is supported on a finite number of atoms. Next, by splitting 
tliese atoms (by replacing an atom uo by, say, two atoms uoi and uj2-, each 
carrying measure \^i{ui) and by defining f{uji) = f{oj2) = f{oj) for / G A), 
one can make the measure almost uniform without changing neither the 
covering numbers nor the shattering dimension of A. Therefore, assume that 
the domain is {1, 2, . . . , n} for some integer n, and that /i is the uniform 
measure on Vt. 

Fix < if: < 1/2 and let A be a 2t-separated in the i^2(/^) norm. By 
Lemma |T^, there is a set of coordinates s C {1,...,^} of size |cr| < '"'°f 
such that A is t-separated in L2{fia), where is the uniform probability 
measure on a. 

Let p = [7/tJ, define A C {0, 1, ...,pY by 



A 



7m 
t 



feA]., 



and observe that A is 6-separated in L2(/io-). By Proposition p^ , 



\A\ = U < ' 



d 



Cd 



where d = vc{A, 2), implying that 



Cd 



' ' - V dt^ J 

By a straightforward computation, 

and our claim follows from the fact that vc(A, 2) < vc{A, t/7). 



Remark. Theorem |T] also holds for the Lp{fi) covering numbers for all 
< p < oo, with constants K and c depending only on p. The only mi- 
nor modification of the proof is in Lemma |^, where the equations would be 
replaced by appropriate inequalities. 
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3 Applications: Gaussian Processes and Con- 
vexity 

The first application is a bound on the expectation of the supremum of a 
Gaussian processes indexed by a set A. Such a bound is provided by Dudley's 
integral in terms of the L2 entropy oiA; the entropy, in turn, can be majorized 
through Theorem |l] by the shattering dimension of A. The resulting integral 
inequality improves the main result of M. Talagrand in [[!' 92|| . 

If A be a class of functions on the finite set J, then a natural Gaussian 
process {Xa)a<^A indexed by elements of A is 

= ^ 9i aii) 

where gi are independent standard Gaussian random variables. 

Theorem 15 Let A be a class of functions bounded by 1, defined on a finite 
set I of cardinality n. Then E = Esup^g^X^ is bounded as 

E < ! v^vc(A,t) ■ log(2/t) dt, 

JcE/n 

where K and c are absolute positive constants. 

The nonzero lower limit in the integral will play an important role in the 
application to Elton's Theorem. 

The first step in the proof is to view A as a subset of M". Dudley's integral 
inequality can be stated as 



POO 

E<K ^/\og N{A,tDn) dt, 
Jo 



where D„ is the unit Euclidean ball in M", see IFl Theorem 5.6. The lower 



limit in this integral can be improved by a standard argument. This fact was 
first noticed by A. Pajor. 

Lemma 16 Let A be a subset o/M". Then E = Esup^g^X^ is bounded as 

POO 

E<K ^/\og N{A,tDn) dt, 

JcE/y/n 

where K is an absolute constant. 
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Proof. Fix positive absolute constants Ci,C2 whose values will be specified 
later. There exists a subset M of A, which is a (^£)-net of A with respect 

to the Euclidean norm and has cardinality \N'\ < N{A, ^^Dn). Then A C 

ciE_ 



J\f + ^^Dn, and one can write 



E = EsnpXa<E max X„ + E sup X^. (12) 



The first summand is estimated by Dudley's integral as 

EmaxX„<ir/ ^/\og N{Af,tDn) dt. (13) 
On the interval (0, ^), 

C2-B 

K j"^ Vlog iV(Ar, tD^) dt<K^- Vlog 



<K^-./logiV(A|^Dj 



The latter can be estimated using Sudakov's inequality 0, |P| , which states 
that e^\og{N, eD„) < fsTE 

SUPagA-^a fo^^ S-ll £ > 0. ludccd, 

■ JlogN{A,^Dr,) < iri(2c2/ci)EsupX, = K,{2c2/c,)E < -E, 
if we select Ci as C2 — C\ /9iKi. Combining this with ([T3| ) implies that 

EmaxXa < -E + fsT / v^logX(A, tL)„) tit (14) 

because A/" is a subset of A. 

To bound the second summand in (|T2|), we apply the Cauchy-Schwarz 
inequality to obtain that for any t > 0, 



/\r^ \ 1/2 

E sup X„ < t • E > ^2 < t^n. 
In particular, if Ci < 1/4 then 



E sup Xa < CiE < -E. 

^'^l^r, 4 
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This, and imply that 

POO 

E<K2 ^/\ogN{A,tDr,) dt, 
J22E 

where K2 is an absolute constant. 



Proof of Theorem |15| . By Lemma ^ 



E<K ^/\og N{A,tDn) dt. 

JcE/y/ri 

Since A C [—1, 1]"" C ^/nDn-, the integrand vanishes for t > ^Jn. Hence, by 
Theorem |T| 

E<K y^log N{A,Wn) dt 

JcE/^ 



= Ky/n / J\ogN{A,ty/nDn) dt 

JcE/n 

<Ki^ I v^vc(A,cit) ■log(2/t) dt. 

JcE/n 

The absolute constant < Ci < 1/2 can be made 1 by a further change of 
variable. ■ 



The main consequence of Theorem |T5| is Elton's Theorem with the optimal 
dependence on 5. 

Theorem 17 There is an absolute constant c for which the following holds. 
Let xi, . . . ,Xn be vectors in the unit ball of a Banach space. Assume that 



4j: 



i=l 



> 6n for some number 6 > 0. 



Then there exist numbers s, t G {c6, 1), and a subset a G {1, . . . ,n} of cardi- 
nality |cr| > s'^n, such that 



^^ajXj > for all real numbers (oj). 



(15) 



In addition, the numbers s and t satisfy the inequality s ■ t\og ' (2/t) > c6. 
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Before the proof, recall the interpretation of the shattering dimension of 
convex bodies. If a set B C M" is convex and symmetric then vc{B, t) is the 
maximal cardinality of a subset a of {1, . . . , n} such that -Po-(-B) D [— |, . 
Indeed, every convex symmetric set in M" can be viewed as a class of functions 
on {1, n]. If 0" is t-shattered with a level function h then for every a' d a 
there is some f„i such that fa'ii) > h{i) + t if i G cr' and f„t < h on a\a'. 
By selecting for every such a' the function (/o-/ — /o-\o-/)/2 and since the class 
is convex and symmetric, it follows that Pa{B) D [— |, ^]", as claimed. 

Taking the polars, this inclusion can be written as n R") C 5^, 

where B^ is the unit ball of Denoting by || • ||bo the Minkowski functional 
(the norm) induced by the body B°, one can rewrite this inclusion as the 
inequahty 



> - I Oil for all real numbers (oj). 



where (e^) is the standard basis of M". Therefore, to prove Theorem [17 



one 



needs to bound below the shattering dimension of the dual ball of a given 
Banach space. 



Proof of Theorem |T^. By a perturbation argument, one may assume that 
the vectors (a;j)j<„ are linearly independent. Hence, using an appropriate 
linear transformation one can assume that X = (M", || ■ ||) and that (xj)j<„ 
are the unit coordinate vectors (ej)j<„ in M". Let B = {Bx)° and note that 
the assumption ||ej||x < 1 implies that B C [—1,1]". 
Set 

n n 
i? = E V^^fjXj = Esup Qi h{i). 
1=1 1=1 

By Theorem 

6n<E< K^/n [ ^vc{B,t) ■ log(2/t) dt. 

JcS 

Consider the function 

h(t) = ^ 

where the absolute constant Cq > is chosen so that h{t) dt = 1. It follows 
that there exits some c5 <t <1 such that 

v/vc(5,t)/n-log(2/t) > 5h{t). 
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Hence 

c (5^ 

vc(B,t) > 5-75- — —n. 

Therefore, letting = vc{B,t)/n, it follows that s ■ tlog^'^(2/t) > ^/co5 as 
required, and by the discussion preceding the proof there exists a subset a of 
{1, . . . , n} of cardinality \a\ > s^n such that (|15D holds with t/2 instead of t. 
The only thing remaining is to check that s> 8. Indeed, s > ^ Yo^{2/t) — '^1'^' 
because t < 1. m 

Remarks. 1. As the proof shows, the exponent 1.6 can be reduced to any 
number larger than 3/2. 

2. The relation between s and t in Theorem ^ is optimal up to a loga- 
rithmic factor for all < 5 < 1. This is seen from by the following example, 
shown to us by Mark Rudelson. For < 5 < l/\/n, the constant vectors 
Xi = 6y/n ■ ci in X = M show that st in Theorem [l^ can not exceed 6. 
For \l\fn < (5 < 1, we consider the body D = conv(i?" U j^Dn) and 
let X = (M", II ■ \\d) and Xj = e,, i = l,...,n. Clearly, E||^5fjXi||x > 
E|| ^£462 II D = Sn. Let < s,t < 1 be so that ([T5| ) holds for some subset 
ad {1, . . . of cardinality |cr| > s'^n. This means that \\x\\d > ^II^^Hi for 
all X e M''. Dualizing, j^||x||2 < t\\x\\D° < \\x\\oo for all x G R"" . Testing 

t 



this inequality for x = X^ieo-^*' evident that -g^ylcl < 1 and thus 
st < 6. 

We end this article with an application to empirical processes. A key 
question is when a class of functions satisfies the central limit theorem uni- 
formly in some sense. Such classes of functions are called uniform Donsker 
classes. We will not define these classes formally but rather refer the reader 
to 10, |VW| ] for an introduction on the subject. It turns out that the uniform 



Donsker property is related to uniform estimates on covering numbers via 
the Koltchinskii-PoUard entropy integral. 

Theorem 18 /0/ Let F be a class of functions bounded by 1. If 
/ sup sup -\/log A^(F, L2(/i„), e) de < 00, 

Jo n 

then F is a uniform Donsker class. 
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Having this condition in mind, it is natural to try to seek entropy estimates 
which are "dimension free" , that is, do not depend on the size of the sample. 
In the {0, l}-valued case, such bounds where first obtained by Dudley who 
proved Theorem ^ for these classes (see |[LT|| Theorem 14.13) which implied 
through Theorem [l^ that every VC class is a uniform Donsker class. 

Theorem |l| solves the general case: the following corollary extends Dud- 
ley's result on the uniform Donsker property from {0, 1} classes to classes of 
real valued functions. 



Corollary 19 Let F be a class of functions bounded by 1 and assume that 
the integral 



^ ^Yc{F,t)\ogjdt 



converges. Then F is a uniform Donsker class. 

In particular this shows that if vc(F, t) is "slightly better" than 1/t^, then 
F is a uniform Donsker class. 

This result has an advantage over Theorem [T3 because in many cases it 
is easier to compute the shattering dimension of the class rather than its 
entropy (see, e.g. ||AIj|]). 
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